We introduce a learning method called ``gradient-based reinforcementplanning'' (GREP). Unlike traditional DP methods that improve their policybackwards in time, GREP is a gradient-based method that plans ahead andimproves its policy before it actually acts in the environment. We deriveformulas for the exact policy gradient that maximizes the expected futurereward and confirm our ideas with numerical experiments.
展开▼